Goto

Collaborating Authors

 forward pass








Understanding Transformer Predictions Through Memory Efficient Attention Manipulation

Neural Information Processing Systems

Most crucially, they require prohibitively large amounts of additional memory since they rely on backpropagation which allocates almost twice as much GPU memory as the forward pass. This renders it difficult, if not impossible, to use explanations in production.



AT echnical Proofs Proof of Proposition 4.1.. Using the chain rule, (1), and the definitions of null

Neural Information Processing Systems

This appendix presents the technical details of efficiently implementing Algorithm 2. B.1 Computing Intermediate Quantities We argue that in the setting of neural networks, Algorithm 2 can obtain the intermediate quantities ζ Algorithm 3 gives a subroutine for computing the necessary scalars used in the efficient squared norm function of the embedding layer.Algorithm 3 Computing the Nonzero V alues of n In the former case, it is straightforward to see that we incur a compute (resp. F .1 Effect of Batch Size on Fully-Connected Layers Figure 4 presents numerical results for the same set of experiments as in Subsection 5.1 but for different batch sizes |B | instead of the output dimension q . Similar to Subsection 5.1, the results in Figure 4 are more favorable towards Adjoint compared to GhostClip.